Skip to content

docs: add a tutorial on how to write a new health monitor#1428

Merged
lalitadithya merged 4 commits into
NVIDIA:mainfrom
nitz2407:nitijain/NKX-12166
Jun 30, 2026
Merged

docs: add a tutorial on how to write a new health monitor#1428
lalitadithya merged 4 commits into
NVIDIA:mainfrom
nitz2407:nitijain/NKX-12166

Conversation

@nitz2407

@nitz2407 nitz2407 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a complete, developer-facing guide for writing NVSentinel health monitors, and its one-shot AI prompt.

Testing

Validated one-shot AI prompt end-to-end on a local KIND cluster via Tilt.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • [] Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Documentation
    • Added a new “Writing a New Health Monitor” tutorial with end-to-end guidance to build, test, containerize, and deploy a health monitor (no GPU) that reports faults via a Unix-domain gRPC socket.
    • Documents the required platform connector health event contract, required vs. recommended fields, and key semantics for healthy/fatal behavior and recommended actions.
    • Includes local verification steps using Tilt/cluster tooling and required ruleset matching for emitted monitor events.
    • Expanded the documentation with a new tutorials section and a direct link to the guide.

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a README link to tutorials and a new end-to-end guide for writing, validating, packaging, and registering an NVSentinel health monitor, including protocol details, an example Go implementation, remediation rules, and CI/ko integration.

Changes

Health Monitor Tutorial

Layer / File(s) Summary
Tutorial entry and scope
docs/README.md, docs/tutorials/writing-a-health-monitor.md
Adds the tutorials section in the README and introduces the tutorial purpose and monitor responsibilities.
Contract and rules
docs/tutorials/writing-a-health-monitor.md
Documents the required gRPC API, HealthEvent field requirements, ProcessingStrategy semantics, and the identity, transition, clearing, and isFatal rules.
Demo monitor implementation
docs/tutorials/writing-a-health-monitor.md
Shows the Go demo monitor with env-based configuration, Unix-socket gRPC dialing, trigger-file polling, edge-triggered emission, and event-building helpers.
Validation and image build
docs/tutorials/writing-a-health-monitor.md
Describes the Tilt/KIND local workflow, the fault-quarantine ruleset, and the multi-stage Dockerfile for building the demo monitor image.
Production publisher and reference
docs/tutorials/writing-a-health-monitor.md
Switches publishing to commons/pkg/healthpub, including socket gating, retries, a buildEvent test, and the appendix material for actions, paths, and prompt text.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through docs and found a tune,
A health monitor guide beneath the moon.
From socket to chart, the path is bright,
With trigger-file steps and guidance just right.
Thump thump—this burrow feels complete!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and concisely describes the new documentation tutorial added for writing a health monitor.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 227-249: Validate the parsed POLL_INTERVAL_SECONDS value in the
health monitor startup flow before creating the ticker in the polling loop. In
the section that currently parses pollSeconds and later calls time.NewTicker,
add a guard to reject zero or negative values and return a clear error instead
of proceeding. Keep the fix near the existing POLL_INTERVAL_SECONDS handling so
the failure is caught before the ticker is constructed.
- Around line 520-530: The setup steps in the tutorial need to be reordered
because `go get github.com/nvidia/nvsentinel/commons@v0.0.0` will fail in a
fresh checkout until the local replacement is already in place. Update the
instructions in the `writing-a-health-monitor` section so the `go.mod` replace
for `github.com/nvidia/nvsentinel/commons` is added first (or via `go mod edit
-replace`) before running `go get`, keeping the dependency fetch step after the
replace is configured.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fd89c363-9c5b-4a74-9877-39fa927e7df7

📥 Commits

Reviewing files that changed from the base of the PR and between 873f3ef and 5aed4a5.

📒 Files selected for processing (2)
  • docs/README.md
  • docs/tutorials/writing-a-health-monitor.md

Comment thread docs/tutorials/writing-a-health-monitor.md
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor-provider/pkg/csp/generic 0.00% (ø)
github.com/nvidia/nvsentinel/labeler 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/devicecounts 58.85% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/labeler 52.69% (-1.15%) 👎
github.com/nvidia/nvsentinel/labeler/pkg/metrics 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator 54.73% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 34.59% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/janitor-provider/pkg/csp/generic/generic.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/labeler/main.go 0.00% (ø) 152 0 152
github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/device_counts.go 56.65% (ø) 406 230 176
github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/resource_slices.go 89.66% (ø) 29 26 3
github.com/nvidia/nvsentinel/labeler/pkg/initializer/init.go 0.00% (ø) 52 0 52
github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler.go 58.74% (-1.29%) 698 410 (-9) 288 (+9) 👎
github.com/nvidia/nvsentinel/labeler/pkg/labeler/resource_slice_events.go 1.22% (ø) 82 1 81
github.com/nvidia/nvsentinel/labeler/pkg/metrics/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator/evaluator.go 55.03% (ø) 636 350 286
github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator/types.go 33.33% (ø) 9 3 6
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go 34.59% (ø) 1590 550 1040

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor-provider/pkg/csp/generic/generic_test.go
  • github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/device_counts_test.go
  • github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/resource_slices_test.go
  • github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator/evaluator_integration_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_integration_test.go

@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch 2 times, most recently from 89c9f74 to ec0590a Compare June 26, 2026 12:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 15-19: The blockquotes in the tutorial markdown currently contain
blank lines that trigger markdownlint MD028, so tighten the copy in the affected
section and the repeated block around Appendix C by removing the internal blank
lines or splitting each quoted paragraph into separate blockquote blocks. Update
the relevant markdown text in the tutorial content so the quoted lines remain
readable but no longer have blank lines inside the same blockquote.
- Around line 263-265: The trigger-file check is treating all stat failures as
“healthy,” which can hide permission or I/O problems and clear an active fault;
update the fileExists helper and the demo check logic so only os.IsNotExist is
interpreted as the file being absent. Return the error from fileExists instead
of collapsing every os.Stat failure to false, and in the demo check path that
logs with slog.Info, propagate non-not-exist errors as unhealthy rather than
deriving healthy solely from fileExists. Apply the same fix to both occurrences
referenced by the demo snippets so the health monitor behavior is consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 62ef8d12-b0b4-4b72-9c3c-2e94b6c95a67

📥 Commits

Reviewing files that changed from the base of the PR and between 5aed4a5 and ec0590a.

📒 Files selected for processing (2)
  • docs/README.md
  • docs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
  • docs/README.md

Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 64487fd to 02e66f5 Compare June 26, 2026 16:01

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/tutorials/writing-a-health-monitor.md (1)

40-40: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Align the socket path with the mount.

The diagram says the monitor talks to unix:///var/run/nvsentinel.sock, but the deploy step mounts hostPath /var/run/nvsentinel into /var/run. That only works if the socket actually lives under a nested directory, so readers will end up wiring the wrong path. Please make the mount and URI agree.

Also applies to: 368-369

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tutorials/writing-a-health-monitor.md` at line 40, The diagram and
deploy instructions use mismatched socket paths, so update the health monitor
docs to make the mount target and the gRPC URI refer to the same location.
Adjust the wording and diagram in the writing-a-health-monitor tutorial around
the monitor-to-Platform Connector connection so the socket path shown for the
monitor matches the hostPath mount used in the deployment step.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Line 40: The diagram and deploy instructions use mismatched socket paths, so
update the health monitor docs to make the mount target and the gRPC URI refer
to the same location. Adjust the wording and diagram in the
writing-a-health-monitor tutorial around the monitor-to-Platform Connector
connection so the socket path shown for the monitor matches the hostPath mount
used in the deployment step.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f8a3d558-7b55-443b-920b-297b665bae97

📥 Commits

Reviewing files that changed from the base of the PR and between ec0590a and 64487fd.

📒 Files selected for processing (1)
  • docs/tutorials/writing-a-health-monitor.md

@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 02e66f5 to 1105a3a Compare June 26, 2026 16:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/tutorials/writing-a-health-monitor.md (1)

152-173: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Clarify the go.mod scaffold wording.

This says "two local replace directives", but the snippet only adds data-models here; commons is introduced later in section 8. As written, it reads like a missing step.

🛠 Suggested wording
-Create `go.mod` with the two local replace directives every monitor uses (the module
-path is resolved locally, not from a registry):
+Create `go.mod` with the local replace directive needed for the demo module (the
+module path is resolved locally, not from a registry):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tutorials/writing-a-health-monitor.md` around lines 152 - 173, The
`go.mod` scaffold wording is inconsistent with the snippet because it claims
“two local replace directives” even though this section only adds the
`data-models` replace and introduces `commons` later in section 8. Update the
tutorial text around the `go.mod` example to say this scaffold includes the
local replace for `data-models` only, and mention that the `commons` replace is
added later when the production publisher is introduced, so readers do not think
a step is missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 626-651: The publish example currently introduces fakePC but does
not use it, and the guidance about healthpub.New is incomplete. Update the
tutorial section around fakePC, TestBuildEvent_Unhealthy, and the Publish
example so the fake client is actually used when demonstrating delivery through
healthpub, or remove the fakePC mention entirely if only buildEvent is being
tested. Make the example consistent with the non-unix target behavior of
healthpub.New so the fakePC path is exercised instead of left unused.

---

Nitpick comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 152-173: The `go.mod` scaffold wording is inconsistent with the
snippet because it claims “two local replace directives” even though this
section only adds the `data-models` replace and introduces `commons` later in
section 8. Update the tutorial text around the `go.mod` example to say this
scaffold includes the local replace for `data-models` only, and mention that the
`commons` replace is added later when the production publisher is introduced, so
readers do not think a step is missing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 15e923c0-04d0-4d5e-9098-d2553ea37bdd

📥 Commits

Reviewing files that changed from the base of the PR and between 64487fd and 02e66f5.

📒 Files selected for processing (2)
  • docs/README.md
  • docs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
  • docs/README.md

Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 1105a3a to c7c3147 Compare June 26, 2026 17:26
nitz2407 added 2 commits June 26, 2026 23:04
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from c7c3147 to fb27671 Compare June 26, 2026 17:34
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from fb27671 to 294532a Compare June 27, 2026 07:18
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch 2 times, most recently from ae4c4a2 to 837cc0b Compare June 29, 2026 12:50
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch 2 times, most recently from be62467 to 47a1698 Compare June 29, 2026 15:14
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 8428c6e to b453a86 Compare June 30, 2026 04:49
@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/labeler 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/labeler 55.89% (-1.06%) 👎
github.com/nvidia/nvsentinel/labeler/pkg/metrics 0.00% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/labeler/main.go 0.00% (ø) 156 0 156
github.com/nvidia/nvsentinel/labeler/pkg/initializer/init.go 0.00% (ø) 52 0 52
github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler.go 62.56% (-1.19%) 673 421 (-8) 252 (+8) 👎
github.com/nvidia/nvsentinel/labeler/pkg/metrics/metrics.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go

@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 7691e45 to b453a86 Compare June 30, 2026 05:16
@lalitadithya lalitadithya enabled auto-merge (squash) June 30, 2026 05:24
@lalitadithya lalitadithya changed the title chore: add a tutorial on how to write a new health monitor docs: add a tutorial on how to write a new health monitor Jun 30, 2026
@lalitadithya lalitadithya merged commit b5e4a6b into NVIDIA:main Jun 30, 2026
81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants